The Covid event created new incentives for city dwellers to move from condos to detached either in or outside Toronto.
In this contest we propose a simple tool for getting informed on the transactions that have happened in Toronto with the purpose of both:
estimating of the market value of a property in Toronto given location and other userr defined features
and, given budget and other constraints,visualizing the properties sold.
At the moment, given the project time budget constraint, even though we have been able to web scrape the latest available listings on zoocasa but did not have sufficient time to clean and process the data. Instead, toillustrate the business case solution, we used a pre-existing dataset available as at H2 2019 which was pre-cleaned and processed.
The current project builds upon a pre-existing project on Toronto housing price prediction available at https://github.com/slavaspirin/Toronto-housing-price-prediction
The dataset used from the pre-existing project is available at https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv
The data from the pre-existing project is based on web scarping of Zoocasa listings of previously sold properties. Unfortunately the data does not have a time stamp. We understand that the primary listing data was scraped from https://www.zoocasa.com and contains a list of sold properties available as sometimes in H2 2019.
We were able to scrap a scoring of Toronto neighborhoods from https://torontolife.com/neighbourhood-rankings/ to complemet the average 2016 personal income data of each district that was already available in the pre-existing project.
Toronto neighborhoods data for geographical mapping was available from https://open.toronto.ca/dataset/neighbourhoods/.
We were also able to obtain location data for the Toronto subway stations from https://scruss.com/blog/2005/12/14/toronto-subway-station-gps-locations/#comments. For the line currently in construction, Line 5 Eglington, the subway stations latitude and longitude were obtained from https://en.wikipedia.org/wiki/Line_5_Eglinton.
The data available includes 15234 listings of Toronto properties with the following available features.
| Variable Name | Description |
|---|---|
| title | text, Zoocasa short description of the listing |
| final price | numeric,sale price |
| listed price | numeric, listed price |
| bedrooms | text ordinal, 0 beds, 0 + 1 beds, 1 beds … 9 + 5 beds |
| bedrooms>grade | numeric, number of bedrooms above grade |
| bedrooms<grade | numeric, number of bedrooms below grade |
| bathrooms | mumeric, 1 to 11 |
| sqft | Missing or numeric between 259 to 4374 |
| description | text, Zoocasa long description of the listed property |
| mls | text, zoocasa identifier |
| type | text categorical, Att/Row/Twnhouse, Comm Element Condo, Condo Apt, Condo Townhouse, Co-Op Apt, Co-Ownership Apt, Detached, Link, Plex, Semi-Detached, Store W/Apt/Offc |
| full link | text,Zoocasa web link |
| lat | numeric, property location latitude |
| long | numeric, property location longitude |
| city district | text, Toronto city district |
| district code | numeric, Toronto city district identifier code |
| mean district income | numeric, Toronto city district average household income based on 2016 statics |
Based on the “bedrooms>grade” and “bedrooms<grade” we created an aggregated bedrooms feature calculated as “bedrooms>grade”+bedrooms<grade/2 to account for the smaller size of the below grade bedrooms.
Based on the “listed price” and “final price” we created an “price differential” feature calculated as “final price”/“listed price” - 1. Even though such a feature is not necessarily useful for predicting the sale price, it is informative with respect to the pricing error that property sellers have encountered and may be informed upon with when listing a property.
Based on location data for the Toronto subway stations (including Line 5 Eglinton) we were able to estimate the closest subway station to a property and estimate, assuming and average walking speed of 5 km/h the walking distance to the closest subway station for each property.
| Variable Name | Description |
|---|---|
| district code | numeric,Toronto city district identifier code |
| area name | text,Toronto city district name |
| description | text,description of the district |
| housing score | numeric, score based on affordability (cost vs. income), appreciation (yoy change) and rate of home ownership |
| safety score | numeric, score based on number of crimes |
| transit score | numeric, score based on number of TTC stops, walk and transit scores, commuting times, numbers of commuters who walk, cycle or take TTC |
| shopping score | numeric, score based on number of groceries, markets and pharmacies per km2 |
| health score | numeric, score based on number of medical and mental health services per capita, number of senior care services per senior, number of people with family doctors and physical activity levels among residents |
| entertainment score | score based on numeric,number of gyms, sport facilities, bars and restaurants per km2 |
| community score | numeric, score based on voter turnout, community space use per capita, how many people report a sense of community belonging |
| education score | numeric, score based on number of schools per child, number of daycares per child, share of residents with post-secondary education |
| diversity score | numeric, score based on % of visible minorities , people whose mother tongues are not French or English, and first- and second generation immigrants |
| employment score | numeric, score based on employment and unemployment rates, the share of residents below the poverty line, the share of high income residents and the share of self employed residents |
The data pertaining to housing listings was cleaned,aggregated and readily available on https://github.com/slavaspirin/Toronto-housing-price-prediction/raw/master/houses_edited.csv
The histogram of sale prices and log sale prices for the most condos, townhouses, detached and semi-detached(the inclusion criteria captures 94% of our listings) indicate that the log transforms shift the distribution closer to a Gaussian one. This will allow us to better model the listings with prices around the mode of the distribution and below.
One can also observe that the sale relative to listed price has a positively skewed distribution when above zero, meaning that sellers were getting more than listed. Nevertheless, the very high values observed (50% mark up) make us less inclined to use the listed price in our analysis. It is possible there may be a bias related to increasing the marketability of the property and gathering more offers during the listing period.
An interesting property of condos is related to the subway walking distance feature. Condos are concentrated within less than 39 minutes to the closest subway station as the corresponding histogram abruptly drops around the 39 minutes threshold.
The Toronto district scores seem to be designed to be uniformly distributed, in contrast with the district average income which seems to be non-uniform.
One can observe the following strong correlations:
Given the nature of the district average income having a different distribution that teh districts scores we computes the Spearman (rank) correlation.
We have also computed the Pearson correlations and observed that the most extreme values was 0.83 (shopping vs. entertainment scores). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).
We have computed the Pearson correlations to observe how all property and location features are interconnected. As expected the size of the apartment influences the sale price. In terms of district location, district average income and teh sores for safety, diversity and employment are also significantly correlated (1% p-value test) with the sale price.
Unexpectedly, the “subway walking distance” to has a negative, small correlation with the final sale price. One of the reasons could be related to difference in this relationship across types of properties. For condos, subway closeness is much more important than for detached,. For detached houses, the size of the property, and implicitly the price, is higher, the further away one gets from the subway lines.
We note that the most extreme correlation values are below 0.85 (excluding the “bedroom>grade” vs “Bedrooms Agg” correlation). As such, these correlations would not pose problems in a linear regression (i.e. the X’X matrix is invertible).
## [1] "As expected, there is a clear dependency between the type of house and the sale price. Condos have a price distribution centered at a lowar level than detached, Plex and Semi-Detached."
## [1] "As expected, there is a clear increasing relationship between the number of bedrooms and the sale price"
## [1] "As expected, there is a clear increasing relationship between the number of bathrooms and the sale price"
## [1] "There is a clear increasing relationship between the surface of property and the sale price. One can observe that a lot of properties sold have missing square footage data which will need to be imputed to fill in missing data."
## [1] "There is a not a clear relationship between the distance to the closest subway and the sale price. This might be due to an uneven mix of types of properties which depends on subway proximity. For condos, subway closeness is much more important than for detached. For detached houses, the size of the property, and implicitly the price, is higher, the further away one gets from the subway lines."
## [1] "There is a clear increasing relationship between the district wealth level and the sale price"
## [1] "Although there is no clear relationship between the neignourhood transit score and the average level of the sale price, the shape of the sale price distribution varies visibly which might be related to different proparty types mixes, depending on the district and the transit network"
## [1] "Although there is no clear relationship between the district shopping score and the sale price"
## [1] "Although there is no clear relationship between the neignourhood healthcare score and the average level of the sale price, the shape of the sale price distribution varies visibly which might be related to different proparty types mixes, depending on the district and the healthcare network"
## [1] "There is a clear ... relationship between the district healthcare entertainment and the sale price"
## [1] "There is a no clear relationship between the district healthcare entertainment and the sale price"
## [1] "There is a clear decreasing relationship between the district community score and the sale price"
## [1] "There is a no clear relationship between the district education score and the sale price"
## [1] "There is a clear increasing relationship between the district employment score and the sale price"
The following mappings attempt to provide an overview of the listings available as wel as the relationship between districts and house prices.
One can observe that the listings not uniformly concentrated across Toronto. With respect to housing price levels, both the detached and condo market seem to be exhibiting a positive correlation with the district average income( Agincourt, the Beaches to Cliffcrest zone, Princess-Rosethorn), confirming the positive correlation between prices and district average income.
The data available sets us in a context where we need to estimate the price of a property given a set of numerical, ordinal and categorical features that describe the property.
Torstend and Robin, please input text here about how we did this
Torstend and Robin, please input text here to describe the model
Linear regression can accommodate both numerical and categorical explanatory variables (the categorical predictors are transformed into dummy variables).
To understand and asses the potential effect of the dimensionality reduction methods we employed three methods:
a linear regression which includes all explanatory variables without any dimensionality reduction method
a stepwise model for explanatory variable selection based on the Akaike Information Criterion (depends on the model log likelihood and penalizes the addition of parameters)
a lasso-ridge penalty approach which, once our likelihood function, which penalizes large coefficients (their square value). As such, only high impact variables, that make it through the penalty, will be retained. To fine tune the weight the penalty function and ensure the result is robust to changes in the train data, we use 10 folds cross validation.
Torsten and Robin, please input here the model final form and discussion of it’s features
Torsten and Robin, please input here the model diagnostics results
Random Forest is an ensembling machine learning algorithm which consists in generating multiple decision trees based on random sampling of the data and the predictor variables and then combining their output. for every explanatory variable a classification is generated. Based on belonging to specific classes of a randomly selected set of variable the decision three is built. The user decides how many variables are used in constructing the decision tree. The construction of the classes for each variables accommodates both categorical and numerical variables (continuous or discontinuous) .
The random sampling serves to de-correlate the trees and subsequently reduce the Variance by averaging them and avoid overfitting. The user decides how many trees are used for model averaging. The random forest model needs to be tuned with respect to the number of decision trees and the number of variables randomly sampled at each stage.
In our project, we used the random forest model as a compariosn to the linear regression model with respect to the selection of explanatory variables , estimation errors and model Fit.
One can observe that the distance to subway, number of bathrooms and average district income occur across all random forest models.
For detached, district scores on community, education and trasit apper in addition to the above
For condos, district scores on shopping and health appear in addition to the above.